Main take-aways
- What is the aim of your model comparisons?
- Which marginal effects do you need to show to answer your research question?
- Use KHB method when needed
- Trade-offs of multinomial logistic regression
Week 3: More on categorical data
University of Oxford
Model fit & model comparison
Mediation in logit models
Multinomial regression
df <- read_csv(here("data", "bhps2001smoking.csv"))
dfclean <- drop_na(df)
dfclean$income <- dfclean$income/1000
head(df, 4)# A tibble: 4 × 9
smoker primsec female age income married divsep widow nevermar
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 1 68 8409. 0 0 1 0
2 0 0 1 59 5596. 0 1 0 0
3 0 1 1 69 9814. 0 0 0 1
4 0 0 0 40 29710. 0 0 0 1
ols1 <- lm(income ~ age + female, data = dfclean)
ols2 <- lm(income ~ age + female + primsec, data = dfclean)
summary(ols1)
Call:
lm(formula = income ~ age + female, data = dfclean)
Residuals:
Min 1Q Median 3Q Max
-23.87 -6.78 -2.26 3.55 496.05
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.63013 0.65311 43.84 <2e-16 ***
age -0.18302 0.01146 -15.97 <2e-16 ***
female -8.47546 0.37266 -22.74 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.19 on 5871 degrees of freedom
Multiple R-squared: 0.1191, Adjusted R-squared: 0.1188
F-statistic: 396.9 on 2 and 5871 DF, p-value: < 2.2e-16
R-square: model variance compared to total variance
Read it in output or get the value from the OLS object
Average deviation of observed from predicted values in the original scale of y. Absolute, not relative like R-square.
Is it worth adding education and then marital status?
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 31.4 0.746 42.1 0
2 age -0.164 0.0132 -12.4 4.57e- 35
3 female -8.24 0.369 -22.3 7.26e-106
4 primsec -6.87 0.380 -18.1 3.38e- 71
5 divsep 1.01 0.585 1.73 8.30e- 2
6 nevermar -1.74 0.545 -3.19 1.43e- 3
7 widow 3.08 0.691 4.46 8.20e- 6
[1] 14.19261
[1] 13.82363
[1] 13.78746
We can use the F-test to compare nested models
Analysis of Variance Table
Model 1: income ~ age + female
Model 2: income ~ age + female + primsec
Model 3: income ~ age + female + primsec + divsep + nevermar + widow
Res.Df RSS Df Sum of Sq F Pr(>F)
1 5871 1182597
2 5870 1121715 1 60882 320.273 < 2.2e-16 ***
3 5867 1115282 3 6433 11.281 2.239e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: income ~ smoker
Model 2: income ~ age + female + primsec
Res.Df RSS Df Sum of Sq F Pr(>F)
1 5872 1339197
2 5870 1121715 2 217482 569.05 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
glm(formula = smoker ~ female + age + primsec + income + divsep +
widow + nevermar, family = binomial(link = "logit"), data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.927e-02 1.625e-01 0.488 0.6256
female -1.889e-01 6.965e-02 -2.712 0.0067 **
age -2.924e-02 2.483e-03 -11.777 < 2e-16 ***
primsec 5.690e-01 7.109e-02 8.004 1.20e-15 ***
income -1.724e-05 3.396e-06 -5.077 3.84e-07 ***
divsep 8.564e-01 9.292e-02 9.217 < 2e-16 ***
widow 2.604e-01 1.338e-01 1.946 0.0517 .
nevermar 5.231e-01 8.935e-02 5.855 4.78e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6425.8 on 5873 degrees of freedom
Residual deviance: 6041.8 on 5866 degrees of freedom
(6 observations deleted due to missingness)
AIC: 6057.8
Number of Fisher Scoring iterations: 4
Deviance = distance between actual fit and perfect fit
A perfect model predicts every case as 0 or 1 as observed; it has a deviance of 0
Deviance can range from 0 to infinity
Null model has an intercept only, describing the mean for all cases
This checks out:
Can we do better than this by adding independent variables?
Can we get closer to a saturated model? (ie \(n-1\) predictors)
Raw Residuals \(e_i= y_i - \hat{p}_i\) run from -1 to 1
Pearson Residual and Standardized Pearson Residuals (SPR)
(excellent explanation here)
Compare nested logistic models with a Likelihood Ratio \(\chi^2\) test (aka: LR, \(L^2\), \(G_m\))
m1 <- glm(smoker ~ primsec + age, family=binomial(link='logit'),
data = dfclean)
library(lmtest)
lrtest(null, m1)Likelihood ratio test
Model 1: smoker ~ 1
Model 2: smoker ~ primsec + age
#Df LogLik Df Chisq Pr(>Chisq)
1 1 -3212.9
2 3 -3085.3 2 255.15 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is easier to predict accurately with more variabes
AIC/BIC penalise adding variables \(k\) (estimated parameters to be more precise)
Akaike Information Criterion: \(-2log(L_m) + 2k\)
Bayesian IC: \(-2log(L_m) + k * log(n)\)
Raftery 1995 on \(\Delta\)BIC: 0-2 weak, 2-6 positive, 6-10 strong, 10+ very strong
Compare nested models with a Likelihood Ratio \(\chi^2\) test (aka: LR, \(L^2\), \(G_m\))
m2 <- glm(smoker ~ primsec + age + income, family=binomial(link='logit'),
data = dfclean)
lrtest(m1, m2)Likelihood ratio test
Model 1: smoker ~ primsec + age
Model 2: smoker ~ primsec + age + income
#Df LogLik Df Chisq Pr(>Chisq)
1 3 -3085.3
2 4 -3073.6 1 23.442 1.287e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Add \(z\) to a model with \(x\) and compare the change in \(\beta_x\) between the two models
How much of the effect of gender on income is due to education?
| (1) | (2) | |
|---|---|---|
| (Intercept) | 28.63 (0.65) <0.01 | 29.51 (0.64) <0.01 |
| age | -0.18 (0.01) <0.01 | -0.13 (0.01) <0.01 |
| female | -8.48 (0.37) <0.01 | -7.88 (0.36) <0.01 |
| primsec | -6.79 (0.38) <0.01 | |
| Num.Obs. | 5874 | 5874 |
| R2 Adj. | 0.119 | 0.164 |
How much of the effect of gender on income is due to education?
female
-8.48
female
-7.88
# Controlling for education closes the gap by
round(ols2$coefficients["female"]-ols1$coefficients["female"],2)female
0.6
Or in relative terms:
Let’s not do the same in logistic regression
🚨 Even if \(Z\) is unrelated to \(X\), \(\beta_x\) will change
In OLS, what happens when we add \(Z\)?
What happens to the variances in a logistic setting?
Write logistic regression as a latent variable model for \(y*\)
\(y*\) simple means that if \(y* >= 0, y=1\) and \(y* < 0 , y = 0\)
\[y* = \beta_0 + \beta_1 X + \varepsilon\] where \(\varepsilon\) has a standard logistic distribution, with a mean of 0 and a variance of \(\pi^2/3 \approx 3.29\)
\[y* = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon\]
\(var(y*) = var(\beta_0 + \beta_1) + \pi^2/3\)
\(var(y*) = var(\beta_0 + \beta_1 + \beta_2) + \pi^2/3\)
Add \(Z\):
- \(var(y*)\) has to increase because the residual variance is fixed (\(\pi^2/3\)), hence model variance goes up
- y is scaled differently in nested models
Question: what happens when one assumes \(\varepsilon\) has a standard normal distribution? What variance would it then have?
How serious or big the problem?
Huge according to some, for instance Mood (2009)
It depends or not probleamtic according to others, for instance Kuha & Mills (2018)
Is your dependent variable binary or are you interested in a latent variable?
What is your estimand?
Approaches to address the scaling issue:
Compare coefficients between two models
Both models have the same fit and same residual variance
Compare X between the two models for the true impact of mediation X by Z
Also called rescaling Packages for Stata and R khb, but easy to do by hand
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.42 0.0512 -27.8 4.27e-170
2 primsec 0.410 0.0641 6.39 1.61e- 10
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0594 0.107 -0.556 5.78e- 1
2 primsec 0.659 0.0674 9.78 1.40e-22
3 age -0.0300 0.00213 -14.1 6.08e-45
models <- list(nlm_reduced,nlm_full)
modelsummary(models, fmt = 2,
statistic = NULL,
estimate = "{estimate} ({std.error}) {p.value}",
gof_map = c("nobs", "aic", "bic"))| (1) | (2) | |
|---|---|---|
| (Intercept) | -1.42 (0.05) <0.01 | -0.06 (0.11) 0.58 |
| primsec | 0.41 (0.06) <0.01 | 0.66 (0.07) <0.01 |
| age | -0.03 (0.00) <0.01 | |
| Num.Obs. | 5874 | 5874 |
| AIC | 6388.0 | 6176.7 |
| BIC | 6401.4 | 6196.7 |
ORs of 1.507 and 1.932
nlm_adj <- glm(smoker ~ primsec + r_age,
family=binomial(link='logit'), data = dfclean)
tidy(nlm_adj)# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.48 0.0523 -28.2 3.52e-175
2 primsec 0.405 0.0651 6.22 5.03e- 10
3 r_age -0.0300 0.00213 -14.1 6.08e- 45
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.48 0.0523 -28.2 3.52e-175
2 primsec 0.405 0.0651 6.22 5.03e- 10
3 r_age -0.0300 0.00213 -14.1 6.08e- 45
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0594 0.107 -0.556 5.78e- 1
2 primsec 0.659 0.0674 9.78 1.40e-22
3 age -0.0300 0.00213 -14.1 6.08e-45
| (1) | (2) | (3) | |
|---|---|---|---|
| (Intercept) | -1.42 (0.05) <0.01 | -0.06 (0.11) 0.58 | -1.48 (0.05) <0.01 |
| primsec | 0.41 (0.06) <0.01 | 0.66 (0.07) <0.01 | 0.40 (0.07) <0.01 |
| age | -0.03 (0.00) <0.01 | ||
| r_age | -0.03 (0.00) <0.01 | ||
| Num.Obs. | 5874 | 5874 | 5874 |
| AIC | 6388.0 | 6176.7 | 6176.7 |
| BIC | 6401.4 | 6196.7 | 6196.7 |
Reduced Adjusted Full
(Intercept) -1.42255 *** -1.47509 *** -0.05936
(0.051170) (0.052275) (0.106759)
primsec 0.40986 *** 0.40487 *** 0.65862 ***
(0.064095) (0.065112) (0.067359)
Resid(age) -0.03000 ***
(0.002133)
age -0.03000 ***
(0.002133)
Pseudo R2 0.010664 0.06391 0.06391
Dev. 6384 6170.7 6170.7
Null 6425.8 6425.8 6425.8
Chisq 41.809 255.15 255.15
Sig *** *** ***
Dl 1 2 2
BIC 6392.7 6188 6188
KHB method
Model type: glm lm (logit)
Variables of interest: primsec
Z variables (mediators): age
Summary of confounding
Ratio Percentage Rescaling
primsec 0.61473 -62.67385 0.9878
------------------------------------------------------------------
primsec :
Estimate Std. Error z value Pr(>|z|)
Reduced 0.404871 0.065112 6.2181 5.033e-10 ***
Full 0.658620 0.067359 9.7777 < 2.2e-16 ***
Diff -0.253748 0.021896 -11.5890 < 2.2e-16 ***
## estimate full logit model and save predicted logodds as V
dfclean$v <- predict(nlm_full, type = "link")
## now fit the substantive models as OLS with V as dependent
rkhb_full <- lm(v ~ primsec + age, data = dfclean)
summary(rkhb_full) # look at that R^2! :-)
Call:
lm(formula = v ~ primsec + age, data = dfclean)
Residuals:
Min 1Q Median 3Q Max
-8.759e-13 3.000e-17 1.700e-16 3.600e-16 8.500e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.936e-02 5.244e-16 -1.132e+14 <2e-16 ***
primsec 6.586e-01 3.244e-16 2.031e+15 <2e-16 ***
age -3.000e-02 9.886e-18 -3.035e+15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.184e-14 on 5871 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 5.44e+30 on 2 and 5871 DF, p-value: < 2.2e-16
Binomial logistic regression:
\[logit(p) = log \frac {p} {1-p} = \beta_0 + \beta_1x_1 + ... + \beta_kx_k\]
\(k\) independent variables
Multinomial logistic regression:
\[logit(p) = log \frac {P(P_i = y)} {P(P_i = ref)} = \beta_0y + \beta_1y X_1 + ... + \beta_ky X_k\] a regression model with \(k\) independent variables when \(y\) has M values, at least more than two values
\[p(y_i = 1) + p(y_i = 2) + ... + p(y_i= M) = 1 \]
dfclean$marital <- 1
dfclean[dfclean$divsep==1,]$marital <- 2
dfclean[dfclean$widow==1,]$marital <- 3
dfclean[dfclean$nevermar==1,]$marital <- 4
dfclean$marital <- as.factor(dfclean$marital)
levels(dfclean$marital) <- c("married","div/sep","widow", "single")
prop.table(table(dfclean$marital))
married div/sep widow single
0.64589717 0.11201907 0.09805924 0.14402451
Four outcomes, a system of three equations:
\[log \frac {P(marstat = divsep)} {P(marstat = married)} = \beta_{10} + \beta_{11}*age + \beta_{12}*female + \beta_{13}*primsec\]
\[log \frac {P(marstat = widow)} {P(marstat = married)} = \beta_{20} + \beta_{21}*age + \beta_{22}*female + \beta_{23}*primsec\] \[log \frac {P(marstat = single)} {P(marstat = married)} = \beta_{30} + \beta_{31}*age + \beta_{32}*female + \beta_{33}*primsec\]
library(nnet)
mnom <- multinom(marital ~ age + female + primsec, data =dfclean, trace = FALSE)
summary(mnom)Call:
multinom(formula = marital ~ age + female + primsec, data = dfclean,
trace = FALSE)
Coefficients:
(Intercept) age female primsec
div/sep -1.386832 -0.01222036 0.4430925 -0.011088953
widow -10.615869 0.11566742 1.3578659 0.575033172
single 1.596513 -0.06436850 -0.2985046 -0.003461308
Std. Errors:
(Intercept) age female primsec
div/sep 0.1659574 0.003049623 0.08790538 0.08756026
widow 0.3690083 0.004810113 0.12152906 0.13760496
single 0.1563999 0.003432239 0.07987463 0.08058662
Residual Deviance: 10201.01
AIC: 10225.01
tidy(mnom) %>% filter(term == "female") %>%
mutate(OR = exp(estimate),
lowerci = exp(estimate-1.96*std.error),
upperci = exp(estimate+1.96*std.error)) # A tibble: 3 × 9
y.level term estimate std.error statistic p.value OR lowerci upperci
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 div/sep female 0.443 0.0879 5.04 4.64e- 7 1.56 1.31 1.85
2 widow female 1.36 0.122 11.2 5.52e-29 3.89 3.06 4.93
3 single female -0.299 0.0799 -3.74 1.86e- 4 0.742 0.634 0.868
Or cleaned up:
# A tibble: 3 × 4
y.level OR lowerci upperci
<chr> <dbl> <dbl> <dbl>
1 div/sep 1.56 1.31 1.85
2 widow 3.89 3.06 4.93
3 single 0.742 0.634 0.868
| (1) | |||
|---|---|---|---|
| div/sep | widow | single | |
| (Intercept) | 0.25 | 0.00 | 4.94 |
| [0.18, 0.35] | [0.00, 0.00] | [3.63, 6.71] | |
| age | 0.99 | 1.12 | 0.94 |
| [0.98, 0.99] | [1.11, 1.13] | [0.93, 0.94] | |
| female | 1.56 | 3.89 | 0.74 |
| [1.31, 1.85] | [3.06, 4.93] | [0.63, 0.87] | |
| primsec | 0.99 | 1.78 | 1.00 |
| [0.83, 1.17] | [1.36, 2.33] | [0.85, 1.17] | |
| Num.Obs. | 5874 | ||
| R2 | 0.161 | ||
| R2 Adj. | 0.160 | ||
| AIC | 10225.0 | ||
| BIC | 10305.2 | ||
| RMSE | 0.34 | ||
dfclean$marital <- relevel(dfclean$marital, ref = "div/sep")
mnom <- multinom(marital ~ age + female + primsec, data =dfclean, trace = FALSE)
modelsummary(mnom, shape = term ~ response:model, fmt = 2,
statistic = 'conf.int', conf_level = 0.95, exponentiate = TRUE)| (1) | |||
|---|---|---|---|
| married | widow | single | |
| (Intercept) | 4.00 | 0.00 | 19.75 |
| [2.89, 5.54] | [0.00, 0.00] | [13.13, 29.72] | |
| age | 1.01 | 1.14 | 0.95 |
| [1.01, 1.02] | [1.12, 1.15] | [0.94, 0.96] | |
| female | 0.64 | 2.50 | 0.48 |
| [0.54, 0.76] | [1.88, 3.31] | [0.39, 0.59] | |
| primsec | 1.01 | 1.80 | 1.01 |
| [0.85, 1.20] | [1.32, 2.44] | [0.82, 1.25] | |
| Num.Obs. | 5874 | ||
| R2 | 0.161 | ||
| R2 Adj. | 0.160 | ||
| AIC | 10225.0 | ||
| BIC | 10305.2 | ||
| RMSE | 0.34 | ||
# A tibble: 9 × 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 married (Intercept) 1.39 0.165 8.39 4.93e- 17
2 married age 0.0123 0.00298 4.13 3.70e- 5
3 married female -0.442 0.0876 -5.05 4.50e- 7
4 widow (Intercept) -9.07 0.390 -23.3 8.39e-120
5 widow age 0.132 0.00543 24.3 5.37e-130
6 widow female 0.955 0.143 6.66 2.71e- 11
7 single (Intercept) 2.98 0.207 14.4 7.36e- 47
8 single age -0.0521 0.00421 -12.4 3.48e- 35
9 single female -0.740 0.108 -6.85 7.45e- 12
| (1) | |||
|---|---|---|---|
| married | widow | single | |
| (Intercept) | 4.01 | 0.00 | 19.75 |
| [2.90, 5.54] | [0.00, 0.00] | [13.15, 29.65] | |
| age | 1.01 | 1.14 | 0.95 |
| [1.01, 1.02] | [1.13, 1.15] | [0.94, 0.96] | |
| female | 0.64 | 2.60 | 0.48 |
| [0.54, 0.76] | [1.96, 3.44] | [0.39, 0.59] | |
| Num.Obs. | 5874 | ||
| R2 | 0.159 | ||
| R2 Adj. | 0.159 | ||
| AIC | 10237.7 | ||
| BIC | 10297.8 | ||
| RMSE | 0.34 | ||
dm <- data.frame(female = rep(c(0,1), each=41), age = rep(c(30:70),2))
pdm <- cbind(dm, predict(mn2, newdata = dm, type = "probs", se = TRUE))
library(reshape2)
lpdm <- melt(pdm, id.vars = c("female", "age"), value.name = "probability")
ggplot(lpdm, aes(x = age, y = probability,
colour = variable, linetype = as.factor(female))) +
geom_line() +
theme_minimal()What you need to cover before the lecture and lab:
revise basics of OLS, variance components, interactions
pratice in R
read relevant part of Snijders & Bosker!
Think about applications of multilevel models in your own research and areas of interest